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Abstract. We analyse a maximum-likelihood approach for combining phylo- 
genetic trees into a larger 'supertree'. This is based on a simple exponential 
model of phylogenetic error, which ensures that ML supertrees have a simple 
combinatorial description (as a median tree, minimising a weighted sum of 
distances to the input trees). We show that this approach to ML supertree 
reconstruction is statistically consistent (it converges on the true species su- 
pertree as more input trees are combined), in contrast to the widely- used MRP 
method, which we show can be statistically inconsistent under the exponential 
error model. We also show that this statistical consistency extends to an ML 
approach for constructing species supertrees from gene trees. In this setting, 
incomplete lineage sorting (due to coalescence rates of homologous genes be- 
ing lower than speciation rates) has been shown to lead to gene trees that 
are frequently different from species trees, and this can confound efforts to 
reconstruct the species phylogeny correctly. 



1. Introduction 

Combining trees on different, overlapping sets of taxa into a parent 'supertree' is 
now a mainstream strategy for constructing large phylogenetic trees. The literature 
on supertrees i s growing steadily: new metho ds of supertree reconstruction are be- 
ing developed (| Cotton and Wilkinson . 20071) and supertree analyses are sh edding 



light on fundamental evolutionary questions (jBininda-Emonds et al.l , [2007) . De- 
spite this surge in research activity, it is probably fair to say that biologists are 
still confused about what supertrees really are and what it is we do when we build 
a supertree. Are we, as some maintain, simply summarising the phylogenetic in- 
formation contained in a group of subtrees? Or are we trying to derive the best 
estimate of phylogeny given the information at hand? Nor is it clear which of these 
two conceptually different objectives underpin the various supertree reconstruction 
methods. 



We take the view that what biologists really want a supertree reconstruction 
method to deliver is the best hypothesis of evolutionary relationships that can be 
inferred from the data available. Obviously, it is not the case that the supertree con- 
structed as a summary statistic will necessarily be the best estimate of phylogeny. 
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Nonetheless, if we are prepared to consider supertree reconstruction a problem of 
phylogenetic estimation, we have at our disposal an arsenal of phylogenctic tools 
and methods that have been tried an d tested. Matrix Representation with Parsi- 
mony f MRP; (|Baum and Ragan . 1992)), Matrix Representation with Compatibility 
(MRC; ( Rodrigo . 1996tlRoss and Rodrigol. 2004)) an d, most recently, Bayesian su- 
pertree reconstruction (BSR, ( Ronquist et al. . 20041 )) are undoubtedly inspired by 
standard phylogenetic methods. A gap remains, though, as there has been remark- 
ably little development of likelihood-based methods for supertree reconstruction. 



In this paper, we analyse one approach to obtain maximum-likelihood (ML) 
estimates of supertrees, based on a probability model that permits 'errors' in subtree 
topologies. The approach is of the type described by Cotton and Page (2004), and 
it permits supertrees to be estimated even if there is topological conflict amongst 
the constituent subtrees. We show that ML estimates of supertrees so obtained 
are statistically consistent under fairly general conditions. By contrast, we show 
that MRP may be inconsistent under these same conditions. We then consider 
a further complication that arises in the supertree setting when combining gene 
trees into species trees - in addition to the possibility that the input gene trees 
are reconstructed incorrectly (either a consequence of the reconstruction method 
used, or some sampling error), there is a further stochastic process that leads to 
the (true) gene trees differing from their underlying species tree (a consequence of 
incomplete lineage sorting). Although simple majority-rule approaches (and gene 
concatenation) have recently been shown to be misleading, we show that an ML 
supertree approach for combining gene trees is also statistically consistent. 



1.1. Terminology. Throughout this paper, unless stated otherwise, phylogenetic 
trees may be either rooted or unrooted, and we will mostly follow the notation of 
Semple and Steel (2003). In particular, given a (rooted or unrooted) phylogenetic 
tree T on a set X of taxa (which will always label the leaves of the tree), any 
subset Y of X induces a phylogenetic tree on taxon set Y, denoted T\Y, which, 
informally, is the subtree of T that connects the taxa in Y only. In the supertree 
problem, we have a sequence V = (71,72, . . . , 7^) of input trees, called a profile, 
where % is a phylogenetic tree on taxon set Xj. We wish to combine these trees into 
a phylogenetic tree T on the union of the taxon sets (i.e. X = X\ U X2 U • • • U Xk). 
We assume that the trees in V are either all rooted or all unrooted, and that T is 
rooted or unrooted accordingly. We will mostly assume that trees are fully-resolved 
(i.e. binary trees, without polytomies); in Section [5] we briefly describe how this 
restriction can be lifted. 

A special case of the supertree problem arises when the taxon sets of the input 
trees are all the same {X\ — X2 = ■ ■ ■ = Xk). This is the much studied consensus 
tree problem. In an early paper McMorris (1990) described how, in this consensus 
setting, the majority rule consensus tree can be given a maximum likelihood inter- 
pretation. However this approach is quite different to the one described here (even 
when restricted to the consensus problem). 

In this paper, we will denote the underlying ('true') species tree as Tq (assuming 
that such a tree exists and that the evolution of the taxa has not involved reticulate 
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processes such as the formation of hybrid taxa). In an ideal world, we would like 
% = To\Xi for each tree % in the profile - that is, we would like each of the 
reconstructed trees to be identical to the subtree of the 'true' tree for the taxa in 
Xq. But in practice, the trees T\, 7^ are unlikely to even be compatible (i.e. no 
phylogenetic tree T exists for which % = T|JQ for all i). 



2. An exponential model of phylogenetic error 



Species trees that have been inferred from data may differ from the true under- 
lying species tree for numerous reasons, including sampling effects (short and/or 
site-saturated sequences, or poorly defined characters), model violation, sequencing 
or alignment errors, and so forth. In this section, we will assume a simple model 
of phylogenetic error in which the probability of observing a given tree falls off 
exponentially with its 'distance' from an underlying generating tree (e.g. the true 
species tree To)- This type of model has been described by Holmes (2003), albeit 
from a different perspective. Suppose d is some metric on resolved phylogenetic 
trees. In the exponential model, the probability, denoted Pr[T'], of reconstructing 
any species tree T on taxon set Y, when T is the generating tree (on taxon set 
X) is proportional to an exponentially decaying function of the distance from T 1 
to T\Y . In other words, 

(1) F T lT}=aexp(-Pd(T,T\Y)). 

The constant j3 can vary with Y and other factors (such as the quality of the data) ; 
for example, trees constructed from long high-fidelity sequences are likely to have a 
larger [3 than trees constructed from shorter and/or noisier sequences. The constant 
a is simply a normalising constant to ensure that Y]f, F-r[T'] = 1, where the sum 
is over all fully resolved phylogenetic trees T' on taxon set Y. When we have a 
sequence (Xx, X2, ■ . .) of subsets of X, we will reflect the dependence of a, (3 on Xi 
by writing a^, and fa. Note that on is determined entirely by fa and \Xi\. 

Note that, implicit in (QJ, the probability of T' depends only on the subtree of T 
connecting the species in T' and not on the other species in T that are not present 
inT'. 

Now, suppose we observe the profile of trees V = (%_, T2, ... ,Tf.) as above, where 
% has leaf set Xi. Assume that, for each i, the tree % has been independently 
sampled from the exponential distribution fll} with (3 = fa. Select a phylogenetic 
X-tree T that maximises the probability of generating the observed profile V (we 
call this type of a tree T an ML supertree for V). In the special case where d is 
the nearest-neighbor interchange (NNI) metric, and the fa values are all equal, this 
ML supertree was described by ( Cotton and Pagel 12004 L An ML supertree has 



a simple combinatorial description as a (weighted) median tree, as the following 
result shows. 

Proposition 2.1. For any metric d on phylogenetic trees, an ML supertree for a 
profile V is precisely a tree T that minimises the weighted sum: 

k 
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Proof. By the independence assumption 

fc 

P T [(r li r 2 ,...,T fe )] = J]PrK], 

i=l 

and, by ©, F r [Tl] = «i exp^c^, T\Xi)). Consequently, P r [(Ti, 2i, . . . , T k )] is 
proportional to 

fc 

cxp(-^/3 l d(7-,T|X I )), 

i=l 

and this is maximised for any tree T that minimises f3id(%, T\Xi). This com- 

pletes the proof. □ 

Notice that in the consensus tree setting, and where d is the symmetric difference 
(Robinson-Foulds) metric, the consensus of the ML supertrees is the same as the 
usual majority rule consensus tree. This follows from earlie r results by Bathelemy 



and McMorris (1986) (see also lCotton and Wilkinson! . 120071 ). 



3. Statistical Consistency of ML supertrees under the exponential 

model 

Is the ML procedure statistically consistent as the number (k) of trees in the 
profile grows? More precisely, under what conditions is the method guaranteed to 
converge on the underlying generating tree 7o as we add more trees to the analysis? 
The problem is slightly different from other settings (such as the consistency of ML 
for tree reconstruction from aligned sequence data) where one has a sequence of i.i.d. 
random variables. In the supertree setting, it is perhaps unrealistic to expect that 
the data-sets are generated according to an identical process, since the sequence of 
subsets Xi , X2 , ... of X is generally deliberately selected. 

To formalise the statistical consistency question in this setting, let Xi,X2, ■ ■ ■ , 
be a sequence of subsets of X . It is clear that the Xi's must cover X in some 'rea- 
sonable' way in order for the ML supertree method to be consistent - for example, 
if some taxon is not present in any Xi, or is present in only a small number of 
input trees, then we cannot expect the location of this taxon in any supertree to 
be strongly supported. 

Thus, we will assume that the sequence of subsets of X satisfies the following 
covering property: For each subset Y of taxa from X of size m (where m = 3 for 
rooted trees or m = 4 for unrooted trees), the proportion of subsets Xi that contain 
Y has strictly positive support as the sequence length (of subsets) increases. More 
formally, for each such subset Y of X we assume there is some e > and some K 
sufficiently large for which: 

(2) < k : Y C Xi}\ > e for all k > K. 

k 

If a subset of taxa, Y, is only found in one or a few trees, and is never seen again 
in trees that are subsequently added, this property will not hold. 
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Now, suppose we sample a random tree T on leaf set Xi according to the ex- 
ponential distribution (fTJ). Let Vk = (T%, ■ ■■,%:) be the resulting profile of indepen- 
dently sampled trees. The following theorem establishes the statistical consistency 
of ML supertrees under the exponential model, when the covering condition prop- 
erty holds. 

Theorem 3.1. Given a sequence Ai, A2, ■ ■ ■ which satisfies the covering property 
(HP, consider a profile Vk — (T\, ■ ■ ■ ,7k), where T is generated independently ac- 
cording to the exponential model (CP with (3 ~ j3i and with generating tree Tq. 
Suppose that Pi > S > for all i. Then the probability that Vk has a unique ML 
supertree and that this tree is Tq tends to 1 as k — > oo. 

Proof. To establish the theorem, using Proposition 18.11 (stated and proved in the 
Appendix) it is enough to specify for each choice of distinct resolved phylogenetic 
A -trees Tq and T, a sequence of events Ek (dependent on Vk) for which, as k 
grows, Ek has a probability that tends to 1 under the distribution obtained from 
Tq and tends to under the distribution obtained from T . Since T differs from Tq 
a subset Y exists of size m (= 3 for rooted trees and = 4 for unrooted) for which 
T\Y ^ Tq\Y . Notice that the covering property ((2]) implies that 

(3) -\{i < k : T\X t 5* Tq\XA\ > e for all k> K. 
k 

Let Ek be the event that among all those i € {1, . . . , k} for which T\Xi ^ T \Xi 
we have % = Tq\Xi more often than % = T\X t . Now, for a profile generated by Tq 
according to the exponential model (TT]), we have 

¥ To [T = (T \X t )} = a ieX p(-A ■ 0) = a,, 

and for each i for which T\Xi ^ T \Xi, we also have 

F To [T = (TlX,)} < ai eM~5d(T\X t ,T \X t )). 

In particular, for each i for which T\Xi ^ T)\Xi, 

(4) P Tn [7- - (T \Xi)] > (1 + v)Pt„{% - (TlX,)} for some V > 0. 

Similarly, for a profile generated by T according to the exponential model |T|) and 
for each i for which T|JQ ^ T)\Xi, we have 

(5) ¥ T [T = (T\Xi)} > (1 + v)Pr[T = (T \X t )} for some r, > 0. 

By condition ([3]), there is a positive limiting proportion (e > 0) of i for which 
T\X% ^ T \Xi; therefore inequalities ([4]) and ([5]) imply that event Ek has a proba- 
bility tending to 1 (or 0) as k — > oo for a profile generated by Tq (or T respectively) 
as required. Statistical consistency of ML now follows by Proposition 18. II □ 

4. Relation to MRP and its statistical inconsistency 

As shown recently by Bruen and Bryant (2007), there is a close analogy between 
MRP (Matrix Representation with Parsimony) and consensus tree methods which 
seek a median tree computed using the SPR (subtree prune and regraft) or TBR 
(tree bisection and reconnection) metric d (recall that a median tree for a profile 
V = (T, . . .Tk) of trees that all have same leaf set A, is a tree T that minimises 
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the sum Yli=i d(T,%)\ cf. Proposition I2.1|l . However, the result from Bruen and 
Bryant (2007) does not guarantee that MRP produces an ML supertree even when 
/3j = 1 for all i. 

We turn now to the question of the statistical consistency of MRP under the 
exponential model ([1]). It can be shown that MRP will be statistically consistent 
under the covering property ^ in some special cases. Two such cases that can be 
formally established (details omitted) are: (i) when all the subsets Xi are of size 
4, or (ii) when 0i is sufficiently large (in relation to \X\). However, in general, we 
have the following result. 

Theorem 4.1. A (3 > exists for which MRP is statistically inconsistent even 
in the special ('consensus') case where, for all i, Xi is the same set of six taxa 
and f3i = (3. More precisely, for this value of (3 and with unrooted fully-resolved 
phylogenetic trees on these (equal) taxon sets, the probability that 7q is an MRP 
tree (for a profile of trees generated under (OJj ) converges to as k tends to infinity. 



Proof. For two unrooted fully- resolved phylogenetic X -trees T, T' let L(T, T) de- 
note the total parsimony score on T of the sequence of splits of T. That is, 

(6) L{T,T')= J2 ^> T '), 

<re£(T) 

where S(T) is the set of splits of T and l( a,T') is the parsimony score of the split 
a on T (treating a as a binary character, (jSemple and Steel l2003h ). For any fully- 



resolved phylogenetic X-tree T let e(T'; 7q) be the expected total parsimony score 
on T of the sequence of splits of a tree T randomly generated by 7o according to 
the exponential model (Q]). Then, 

(7) e(T'; %) = ]T aexp(-/?d(T, %)) ■ L(T, T'). 

T 

To establish Theorem l4.1[ it is enough to show, for some (3 > and for two unrooted 
fully-resolved trees T , 71 on X = {1, ... , 6}, that e(T ; T ) — e(Ti;T ) > 0, since if 
7o is the generating tree, then T\ will be favored over 7o by MRP. We first show 
that this can occur when (3 = 0. In that case, aexp(— (3d(T, To)) = 1/105 for all T 
(there are 105 unrooted fully-resolved phylogenetic trees on X) and so, by ^ and 

e (T ;To)- e (7i;To) = J- £>(T, T Q )-L{T,T X )) = -L Y,^)iK^T Q )-l{a,T 1 )), 

T <y 

where n(o~) is the number of unrooted fully-resolved phylogenetic X-trees contain- 
ing split a and the summation is over all the splits of X = {1, . . . , 6}. Now suppose 
To has a symmetric shape (i.e. an unrooted fully-resolved tree of six leaves with 
three cherries) and T± has a pectinate shape (i.e. an unrooted fully-resolved tree 
of six leaves with two cherries). Then, by using earlier results ( Steel et al. . 19921 
Table 3) and basic counting arguments, it can be shown that 

e(%;%) - e(Ti;T ) = — . 
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So far, we have assumed that (3 = 0, however, e(7o;7o) — e(7i;7o) is a continuous 
function of (3 so a strictly positive value of /3 exists for which 

e(To;To)- e (7i;T ) > ±. 

This completes the proof. □ 

An interesting theoretical question is whether a value s e (0, 1) exists for which 
MRP is statistically consistent (for arbitrarily large taxon sets) under the conditions 
of Theorem 4.1, whenever (3 > s. 



5. Technical remarks 



Extension to trees with polytomies. 

We can easily modify the ML process if some of the input trees are not fully- 
resolved. For a general phylogcnetic tree U (possibly with polytomies) on taxon set 
Xi C X, and a generating fully-resolved phylogcnetic tree T on taxon set X, let 
4>(ti\T) be the probability of the event that the tree % that T generates under the 
exponential model is a refinement of ti. More precisely, 

4>{U\T) = Y, ^cxp(-ftd(7-,T|X')) 

r,>t s 

where % > ti indicates that the (fully-resolved) tree % contains all the splits 
present in ti, and has the same leaf set (Xi). Notice that (f)(ti\T) is not a probability 
distribution on phylogenetic trees with the leaf set X' (its sum is > 1). Nevertheless, 
given a profile V — (ii, . . . , tk) of phylogenetic trees (some or all of which may have 
polytomies), we can perform ML to select the tree T that maximises the joint 
probability Yli=i 01 trie events % > ti for i = 1, . . . , k. 

An alternative perspective on ML supertrees for certain tree metrics. 

We point out an alternative way of viewing this ML procedure applied to a profile 
V = (T\, . . . , 7^) when d is one of two well-known metrics on trees (SPR and TBR). 
Suppose that we were to extend each tree % in V to a tree T[ on the full set of 
taxa (X). We could regard the placement of those taxa that are 'missing' in % 
(namely the taxa in X — Xi) to form a tree T[ on the full leaf set X to be 'nuisance 
parameters' in a maximum likelihood framework (under the exponential model), 
and thereby seek to find the tree T and extensions (T{, . . . , T^) to maximise the 
joint probability: 

Pr[(T/, T 2 ', . . . , Ti)] subject to % = T^\X t for all i. 

We call such a tree T an extended ML tree for the profile V '. 

Proposition 5.1. For d = SPR or d = TBR, and any profile V of fully-resolved, 
unrooted phylogenetic trees, the extended ML tree(s) for V coincides precisely with 
the ML tree(s) forV. 



s 
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Proof. For d = SPR or d — TBR, it can be shown that for any resolved unrooted 
phylogenetic trees T on leaf set X, and % on leaf set X%, that: 

(8) mm{d(%',T) : T!\X t = %} = d{%,T\Xi). 

The result now follows by Proposition ^. II □ 



Note that Equation [8] does not necessarily hold for other tree metrics (such as 
the NNI (nearest- neighbor interchange) or the partition (Robinson- Foulds) metric). 



6. Statistical consistency of ML species supertrees from multiple 

GENE TREES 

A c urrent problem in phylogenetics is how best to infer species trees from gen e 



A current problem m phylogenetics is how best to inter s pecies trees Irom gene 
trees ( Degnan and Rosenber i 12004 iGadagkar et all l2005t iLiu and Pearl l2007h . 



Even in the consensus setting (i.e. when the set of taxa for each gene tree is the 
complete set of taxa under study), Degnan and Rosenberg (2006) have demon- 
strated how incomplete lineage sorting on gene trees can mean that the most likely 
topology for a gene tree can differ from the underlying species tree (for any cer- 
tain rooted phylogenetic trees on four taxa and for all rooted phylogenetic trees 
on five or more taxa). This surprising result implies that simplistic 'majority rule' 
approaches to finding a consensus species tree can be problematic. 

The phenomenon described by Degnan and Rosenberg (2006) is based on the co- 
alescent model for studying lineage sorting in evolving populations. The surprising 
behavior arises only when the effective population sizes and the branch lengths of 
the species tree are in appropriate ranges. Moreover, for 3-taxon trees, the most 
probable gene tree topology always agrees with the species tree topology. Never- 
theless, the fact that larger gene trees can favour an incorrect species tree might 
easily complicate some standard statistical approaches. 

In this section, we show h ow, despite the phenomena described above (from 
iDegnan and Rosenberg! . 20061 ). and even in the more general supertree setting 



(where some gene trees may have some missing taxa), a maximum likelihood ap- 
proach to supertree construction of a species tree from gene trees is statistically 
consistent. Moreover, we frame this approach so that it is sufficiently general to 
also allow for error in the reconstruction of the gene trees (as arises under the 
exponential model). 

Consider a model M that has a generating tree topology T as its sole underlying 
parameter. Such a model will typically derive from a more complex model con- 
taining other parameters (such as branch lengths, population sizes and so forth), 
but we will assume that these have a prior distribution and that they have been 
integrated out, so our model has just one parameter - the tree topology. We say 
that M satisfies the property of basic centrality if, for all subsets of Y of X of size 
m (= 3 for rooted trees and = 4 for unrooted trees), we have: 

(9) F T [T\Y]>(l + r,)F T [T'] 
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for all trees T' on leaf set Y that are different from T\Y, and where rj > 0. For 
example, the exponential model ([1]) satisfies basic centrality, since (j9]) holds for 
all subsets Y of X. For lineage sorting (with a prior distribution on ancestral 
population sizes and branch lengths), the property holds, but only because ([9]) 
holds for the subsets Y of X of size 3, as we describe shortly. Firstly, however, we 
state a statistical consistency result that extends Theorem 13. II 

Proposition 6.1. Given a sequence X\^Xi, ■ ■ ■ which satisfies the covering prop- 
erty consider a profile Vk — {Ti, ■ ■ ■ ,7k), where % is generated independently 
according to a model that satisfies the basic centrality property (with T = %, and 
rji > 5 > for all i) and with generating tree 7q. Then the probability that Vk has 
a unique ML supertree and that this tree is 7q tends to 1 as k — > oo. 



Proof. The proof is similar to the proof of Theorem 13. 1[ the essential difference 
being a modification to the way the events Ek are defined. Given Tq and T (as 
in the proof of Theorem 13. ip . let Ek be the event that for each i £ {1, . .., k} for 
which rci.we have %\Y = %\Y more often than %\Y = T\Y. Then (as in the 
proof of Theorem 13. ip as k grows, Ek has a probability that converges to 1 if To 
is the generating tree, and a probability that converges to if T is the generating 
tree. The theorem now follows by Proposition [STj □ 



We now apply this result in a supertree setting where we have the compounding 
effect of two sources of error: (i) error in using the true gene tree to represent 
the true species tree, due to lineage sorting, and (ii) error in reconstructing the 
correct gene tree, modeled by the exponential model {l}. We claim that an ML 
approach to inferring a species tree in the presence of these two sources of error is 
still statistically consistent (under the coalescent and exponential model ([I]), and 
assuming the covering property), due to the following argument, which justifies the 
basic centrality property. 

Consider a rooted fully-resolved species tree T on X and a rooted full-resolved 
gene tree T on Y, where Y is a subset of X of size 3 (note that we are here 
identifying the taxa in the gene tree with taxa in the species tree). Then the 
probability of observing T' under the combination of these two sources of error 
(treated independently) from a generating species tree T can be written as 



[T']=^PHT"]P T 4n 



r 

r 



where the summation is over the three rooted fully-resolved gene trees on taxon set 
Y, P^-[T"] is the probability that species tree T gives rise to gene tree T" for the 
taxa on Y (under lineage sorting according to the coalescent model), and Pt"P"'] 
is the probability that a generating gene tree T" produces T' , as given by the 
exponential model ([I}. Now, considering lineage sorting under the coalescent model, 
we have ¥> C T [T\Y] = |(1 + 2r) for t € (0,1) while ¥%-\T"] = ±[1- r) for the two 
other choices of T" ^ T\Y (see e.g., Rosenberg . 2002 ; Taiima . I~1983l) . Furthermore, 



under the exponential model ([T]), and assuming, without loss of generality, that d 
takes the value or 1 for each pair of 3-taxon trees, we have Pr" [T"] — ct, while 
Pr»[T'] = ae-l 3 for the other two choices of T' ^ T" (and a = (1 + 2e~ l3 )~ 1 ). 
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Combining these relationships, we obtain: 

P r [T|r] = ^(l + 2T + 2(l-r) e -' 3 ), 

while for the other two choices of T ^ T\Y, we have 

W] = I ((1 + 2r)e- fi + (1 - r)(l + e^)) . 

For any given (3,t > 0, these last two equations imply that for some rj > 0, and for 
the two choices of T' ^ T\Y, we have 

Pt[t\y] > (i + v )p t [t'} 

(in fact, routine algebra shows that we can take rj — 3r(l — e _/3 )). Taking the value 
of rj that is minimal for all subsets Y of X of size 3 establishes the basic centrality 
property in this setting. 



7. Discussion 



To develop a likelihood-based supertree reconstruction method, it is necessary to 
define a model that delivers the probability of obtaining a series of subtree topolo- 
gies, given a hypothesised supertree. We have chosen a very simple yet intuitive 
probability function whereby the probability of observing a 'wrong' subtree (i.e. 
one where the topology differs from that of a pruned supertree) decreases exponen- 
tially as its topology becomes increasingly distant from that of the hypothesised 
supertree. Consequently, the ML supertree can be estimated even when the con- 
stituent subtrees have conflicting topological signals. 

Our approach is model-based, but one may reasonably ask whether the model 
described here is a biologically realistic one. We suggest that it is. For one thing, 
we expect, for a variety of reasons, to see conflicts between the topologies of sub- 
trees and the reconstructed supertree. With gene sequences obtained from different 
species, for instance, incomplete lineage sorting and ancestral heterozygosity fre- 
quently lead to differences between gene trees and species trees. Convergent and 
parallel evolution can confound phylogenetic reconstruction, as can long-branch at- 
traction. We have chosen to use the exponential distribution to describe this steady 
decrease in probabilities as the distances between subtrees and supertrees increase. 
The value of using the exponential distribution lies in the ease with which it can 
be manipulated when we compute log-likelihoods. However, we suggest that one 
fruitful research project may be to explore other possible probability distributions 
and, for that matter, other tree-to-tree distance metrics. 

The likelihood framework provides an additional benefit: a rich body of statisti- 
cal and phylogenetic methods already use likelihood. Moreover, statistical consis- 
tency holds for maximum likelihood supertrees under weak conditions, in contrast 
to MRP, which can be inconsistent in some cases. We also show that the ML 
supertree approach developed here provides a statistically consistent strategy for 
combining gene trees even when there is the possibility that these trees may be 
different from the true species tree. An obvious application of ML supertrees will 
be their use in statistical tests of topologic al hypotheses, and we already know how 
to do this with standard ML phylogenies ([Goldman et all 120001) . 
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We also recognise that our particular likelihood implementation is closely re- 
lated to the 'Majority-Rule(-) Supertree' construction proposed by Cotton and 
Wilkinson (2007)). More precisely, when the tree metric is the symmetric differ- 
ence (Robinson-Foulds) metric, then the Marjority-Rule(-) Supertree is, in effect, 
the strict consensus of our ML supertrees. However, the approach in Cotton and 
Wilkinson (2007) is quite different: they show how to extend majority-rule from the 
consensus to the supertree setting. Nonetheless, they converge on the same opti- 
mality criterion that we use, i.e. a supertree that minimises the sum of distances to 
a set of trees. One should not be surprised that the same optimality criterion can 
emerge from different conceptual bases. With standard phylogenetic reconstruc- 
tion, choosing the tree that minimises the number of evolutionary changes can be 
justified philosophically (with the p rinciple of maximum parsimony) as a consensus 



method (IBruen and BrvantL 12007), or by using an explicitly statistical approach 



(e.g. likelihood (jSteel and Pennvl . |2000h ) 



We have not discussed algorithms to search for ML supertrees. Instead, we direct 
readers to the discussion in Cotton and Wilkinson (2007), since the criterion we 
use is similar to theirs. 



8. Appendix: Consistency of ML for general (non-i.i.d.) sequences 

Here we describe a convenient way to establish the statistical consistency of 
maximum likelihood when we have a sequence of observations that may not be 
independent or identically distributed. We frame this discussion generally, as the 
result may be useful for other problems. In particular, in this result, we do not need 
to assume the sequence samples are independent (though in our applications, they 
are), nor identically distributed (in our applications, they are not). Suppose we 
have a sequence of random variables Y±, Y%, ... that are generated by some process 
that depends on an underlying discrete parameter a that can take values in some 
finite set A. In our setting, the Y^s are trees constructed from different data sets 
(e.g. gene trees), while a is the generating species tree topology. We assume that 
the model specifies the probability distribution of (Yi, . . . , Y k ) given (just) a - for 
example, in our tree setting this would mean specifying prior distributions on the 
branch lengths and other parameters of interest (eg. ancestral population sizes) 
and integrating with respect to these priors. 

Given an actual sequence (y±, ...,yk) of observations, the maximimum likelihood 
(ML) estimate of the discrete parameter is the value a that maximises the joint 
probability 

F a [Y 1 =y 1 ,...,Y k = y k ] 

(i.e. the probability that the process with parameter a generates {yi, ■ ■ • ,Vk))- 
Now suppose that the sequence Yi, . . . Y k , . . . is generated by ao- We would like the 
probability that the ML estimate is equal to ao to converge to 1 as A: increases. If 
this holds for all choice of ao € A, then ML is statistically consistent. The following 
result provides an convenient way to establish this; indeed, it characterises the 
statistical consistency of ML. 
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Proposition 8.1. In the general set-up described above, ML is statistically con- 
sistent if and only if the following condition holds: for any two distinct elements 
a,b G A, we can construct a sequence of events E\, Ei, ■ ■ ., where Ek is dependent 
on (Yi, . . . Yk), for which, as k — > 00: 

(i) the probability of Ek under the model with parameter a converges to 1. 

(ii) the probability of Ek under the model with parameter b converges to 0. 

Proof. The 'only if direction is easy: Suppose ML is statistically consistent and 
a, b G A are distinct. Let Ek be the event that a is the unique maximum likelihood 
estimate obtained from (Y\, . . . , Y/.). Then Ek satisfies conditions (i) and (ii). 

For the converse direction, recall that the variation distance between two prob- 
ability distributions p,q on & finite set W is 

max \FJE) -FJE)\ 

EcW 

where P p (E) = ^2 w ^eP( w ) ^ s the probability of event E under distribution p 
(similarly for ¥ q (E)). This variation distance can also be written as ^\\p — q\\i, 
where \\p — q\\i = J2wew \p( w ) ~ <l( w )\ i s the l\ distance between p and q. Thus, 
if we let d^ k > (a, b) denote the l\ distance between the probability distribution on 
(Yi, . . . , Yk) induced by a and by 6, then conditions (i) and (ii) imply that 

(10) lim -d,W(a,b) = 1. 



Now, by the first part (Eqn. 3.1) of Theorem 3.2 of ([Steel and Szekelvl I2002H . the 
probability that the ML estimate is the value of A that generates the sequence 
(Yi, . . . , Yk) is at least 1 - X^aC 1 ~ 1^ ( a > &)) ana ^ so > (HOI, this probability 
converges to 1 as k — > 00. □ 
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